Fix liveness and readiness probes #396

timuthy · 2022-08-05T10:56:27Z

How to categorize this PR?

/area quality
/kind bug

What this PR does / why we need it:
This PR fixes several issues for the currently used liveness and readiness checks and also adds a startup probe for single- and multi-node etcds.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:
Issue with current liveness probe:

When passing multiple arguments to /bin/sh -ec then only the first argument is considered, i.e. /bin/sh -ec ETCDCTL_API=3 is always successful.

            - /bin/sh
            - -ec
            - ETCDCTL_API=3
            - etcdctl
            - --cert=/var/etcd/ssl/client/client/tls.crt
            - --key=/var/etcd/ssl/client/client/tls.key
            - --cacert=/var/etcd/ssl/client/ca/ca.crt
            - --endpoints=https://etcd-aws-local:2379/
            - get
            - foo
            - --consistency=s

Issue with current readiness probe:

The exec command did not evaluate the return code of the HTTP response and thus the container was considered ready even though the /health(z) endpoint returned != 200.

            - /usr/bin/curl
            - --cert
            - /var/etcd/ssl/client/client/tls.crt
            - --key
            - /var/etcd/ssl/client/client/tls.key
            - --cacert
            - /var/etcd/ssl/client/ca/ca.crt
            - https://etcd-aws-local:8080/healthz

For single-node it's possible to switch to an http probe to solve the explained issue.

For multi-node it's necessary to use an exec probe because the /health endpoint of etcd is protected by mutual TLS (also see etcd-io/etcd#12370) and providing a client cert is not supported via Kubernetes http probes.

The new liveness probe is now accurate, but still fails after few seconds due to the backup sidecar requiring a long time to promote its etcd member from learner. This leads to the etcd container being restarted, and the backup sidecar's initialization fails, and begins another initialization when the etcd container comes back up. This cycle continues, and the etcd pods never become ready. This problem is solved by using a startup probe of 2 minutes to allow the initialization to complete without interruptions due to etcd container restarts.

Release note:

An issue has been fixed that caused the `liveness` and `readiness` probes of `etcd` to always succeed even though an error was reported. This prevented defective etcd pods from being restarted automatically and caused unready candidates being considered as ready to serve traffic via the `etcd service`.

A `startup` probe has been added to `etcd` to allow 2 minutes of initialization time before checking for etcd liveness.

Add support for running envtest on M1 Macbooks.

shreyas-s-rao

@timuthy Thanks for the quick fix! Overall LGTM except for a small nit.
/lgtm

pkg/component/etcd/statefulset/values_helper.go

aaronfern

Thanks for the PR @timuthy!

One comment from me

pkg/component/etcd/statefulset/values_helper.go

timuthy · 2022-08-05T12:18:12Z

Thanks for the reviews @shreyas-s-rao and @aaronfern 🚀 I fixed the type, PTAL.

shreyas-s-rao · 2022-08-05T13:24:36Z

@timuthy the unit tests for sts component are failing because the probes in validateEtcd and validateEtcdWithDefaults functions in the test need to be adapted. Can you please make that change? Thanks!

shreyas-s-rao · 2022-08-05T13:39:57Z

/hold

Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>

aaronfern

/lgtm

The enablement of startup/liveness probes through gardener#396 showed that they cause more harm than good: - The startup time of etcds can vary depending on the state and amount of data - If startup does not happen in the expected time, the failing probes kill the container which does not help to solve the issue at all but will end in a endless loop of restarts - Liveness probes had been disabled for a long time before which never caused issues in our experience. - Other communities have come to a similar conclusion, see https://github.com/improbable-eng/etcd-cluster-operator/blob/master/docs/operations.md#why-arent-there-liveness-probes-for-the-etcd-pods

The enablement of startup/liveness probes through #396 showed that they cause more harm than good: - The startup time of etcds can vary depending on the state and amount of data - If startup does not happen in the expected time, the failing probes kill the container which does not help to solve the issue at all but will end in a endless loop of restarts - Liveness probes had been disabled for a long time before which never caused issues in our experience. - Other communities have come to a similar conclusion, see https://github.com/improbable-eng/etcd-cluster-operator/blob/master/docs/operations.md#why-arent-there-liveness-probes-for-the-etcd-pods Co-authored-by: Tim Usner <tim.usner@sap.com>

The enablement of startup/liveness probes through gardener#396 showed that they cause more harm than good: - The startup time of etcds can vary depending on the state and amount of data - If startup does not happen in the expected time, the failing probes kill the container which does not help to solve the issue at all but will end in a endless loop of restarts - Liveness probes had been disabled for a long time before which never caused issues in our experience. - Other communities have come to a similar conclusion, see https://github.com/improbable-eng/etcd-cluster-operator/blob/master/docs/operations.md#why-arent-there-liveness-probes-for-the-etcd-pods

Fix liveness and readiness checks

5c11f70

timuthy requested a review from a team as a code owner August 5, 2022 10:56

gardener-robot added area/quality Output qualification (tests, checks, scans, automation in general, etc.) related kind/bug Bug needs/review Needs review size/m Size of pull request is medium (see gardener-robot robot/bots/size.py) labels Aug 5, 2022

gardener-robot-ci-1 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 5, 2022

timuthy added this to the v0.12.0 milestone Aug 5, 2022

gardener-robot-ci-1 added needs/ok-to-test Needs approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 5, 2022

shreyas-s-rao approved these changes Aug 5, 2022

View reviewed changes

pkg/component/etcd/statefulset/values_helper.go Outdated Show resolved Hide resolved

gardener-robot added reviewed/lgtm Has approval for merging and removed needs/review Needs review labels Aug 5, 2022

aaronfern requested changes Aug 5, 2022

View reviewed changes

pkg/component/etcd/statefulset/values_helper.go Outdated Show resolved Hide resolved

Fix typo

aa637fe

gardener-robot-ci-2 added the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 5, 2022

gardener-robot-ci-3 removed the reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) label Aug 5, 2022

gardener-robot added the reviewed/do-not-merge Has no approval for merging as it may break things, be of poor quality or have (ext.) dependencies label Aug 5, 2022

Fix integration tests

709ec06

gardener-robot-ci-1 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 5, 2022

Support envtest on M1 Macbooks

a1871f7

Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>

Add startup probe to etcd pods

873c059

Signed-off-by: Shreyas Rao <shreyas.sriganesh.rao@sap.com>

gardener-robot-ci-3 added reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) and removed reviewed/ok-to-test Has approval for testing (check PR in detail before setting this label because PR is run on CI/CD) labels Aug 6, 2022

aaronfern approved these changes Aug 6, 2022

View reviewed changes

gardener-robot added reviewed/lgtm Has approval for merging and removed needs/second-opinion Needs second review by someone else labels Aug 6, 2022

abdasgupta approved these changes Aug 8, 2022

View reviewed changes

shreyas-s-rao merged commit 0ed5c73 into gardener:master Aug 8, 2022

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Aug 8, 2022

timuthy deleted the fix.probes branch August 15, 2022 09:18

timuthy mentioned this pull request Sep 1, 2022

Remove startup and liveness probes #423

Merged

aaronfern mentioned this pull request Sep 1, 2022

Remove startup and liveness probes #424

Merged

ishan16696 mentioned this pull request Sep 22, 2022

[Feature] Liveness Probe on multi-node etcd #280

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix liveness and readiness probes #396

Fix liveness and readiness probes #396

timuthy commented Aug 5, 2022 •

edited by shreyas-s-rao

Loading

shreyas-s-rao left a comment

aaronfern left a comment

timuthy commented Aug 5, 2022

shreyas-s-rao commented Aug 5, 2022

shreyas-s-rao commented Aug 5, 2022

aaronfern left a comment

Fix liveness and readiness probes #396

Fix liveness and readiness probes #396

Conversation

timuthy commented Aug 5, 2022 • edited by shreyas-s-rao Loading

shreyas-s-rao left a comment

Choose a reason for hiding this comment

aaronfern left a comment

Choose a reason for hiding this comment

timuthy commented Aug 5, 2022

shreyas-s-rao commented Aug 5, 2022

shreyas-s-rao commented Aug 5, 2022

aaronfern left a comment

Choose a reason for hiding this comment

timuthy commented Aug 5, 2022 •

edited by shreyas-s-rao

Loading